International Journal of Engineering Technology and Management Sciences 
Website: ijetms.in Issue: 2 Volume No.7 March - April — 2023 
DOI:10.46647/ijetms.2023.v07i02.025 ISSN: 2581-4621 


Heart Disease Diagnosis and Prediction using Multi Linear Regression 


Shah Alam’, Bhaskar Bakshi”, Rupjit Maity*, Sulekha Das*, Dr. Avijit Kumar Chaudhuri * 
1 UG-Computer Science and Engineering, Techno Engineering College Banipur, 
?UG-Information Technology, Techno Engineering College Banipur, 
3UG-Computer Science and Engineering, Techno Engineering College Banipur, 
#Assistant Professor, Computer Science and Engineering, Techno Engineering College Banipur, 
5 Assistant Professor, Computer Science and Engineering, Techno Engineering College Banipur, 
'Orcid Id : 0009-0009-67 1 1-6247, *Orcid Id :0009-0003-3002-6324 
%Orcid Id :0009-0002-9406-7291, “Orcid Id : 0000-0002-6641-3268 
°Orcid Id : 0000-0002-5310-3180 


ABSTRACT 

The correct prediction of heart disease can prevent life threats, and incorrect prediction can prove to 
be fatal at the same time. In this paper machine learning algorithm is applied to compare the results 
and analysis of primary dataset. The dataset consists of 46 attributes among these Information gain is 
used to select 24 features for performing the analysis. Various promising results are achieved and are 
validated using accuracy and confusion matrix. The dataset consists of some irrelevant features which 
are handled and data are also normalized for getting better results. Using machine learning approach, 
77.78% accuracy was obtained. Multiple linear regressions are used to construct and validate the 
prediction system. Our experimental result shows that multiple linear regressions are suitable for 
modelling and predicting cholesterol. 

Keywords- Primary Dataset, Information Gain, 24 Attributes, Analysis, Multiple Linear 
Regression 


INTRODUCTION 

Heart attack diseases remains the main cause of death worldwide, including India and possible 
detection at an earlier stage will prevent the attacks. Medical practitioners generate data with a wealth 
of hidden information present, and it’s not properly being used effectively for predictions. For this 
purpose, the research converts the unused data into a dataset for modelling using different data mining 
techniques. People die having experienced symptoms that were not taken into considerations. There 
is a need for medical practitioners to predict heart disease before they occur in their patients. The 
features that increase the possibility of heart attacks are smoking, lack of physical exercises, high 
blood pressure, high cholesterol, unhealthy diet, harmful use of alcohol, and high sugar levels. Cardio 
Vascular Disease (CVD) incorporates coronary heart, cerebrovascular (Stroke), hypertensive heart, 
congenital heart, peripheral artery, rheumatic heart, inflammatory heart disease. Data mining is a 
knowledge discovery technique to analyze data and encapsulate it into useful information. The current 
research intends to predict the probability of getting heart disease given patient data set. Predictions 
and descriptions are principal goals of data mining, in practice. Description emphasizes on 
discovering patterns that explains the data to be interpreted by humans. The purpose of predictions in 
data mining is to help discover trends in patient data in order to improve their health. Due to change 
in life styles in developing countries, like South Africa, Cardio Vascular Disease (CVD) has become 
a leading cause of deaths. CVD is projected to be a single largest killer worldwide accounting for all 
deaths. An endeavor to exploit knowledge, experience and clinical screening of patients to diagnose 
or recognize heart attacks is regarded as a treasured opportunity. In the health sectors data mining 
plays an important role to predict diseases. The predictive end of the research is a data mining model. 
In this paper, a machine learning method is applied to investigate information regarding Heart 
Diseases, to assess the prescient intensity of these systems. To this aim, Multiple Linear Regression 
algorithm is first developed to predict cholesterol and High Blood Pressure in its early stage. As the 
feature selection algorithm can affect the performance of the Multiple Linear Regression model, a 
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Information Gain is utilized to optimize the model used to predict. This enables the model to achieve 
better accuracy in the prediction and prognosis stages. Besides, the value of the coefficients Bo, B1, 
B2, .... Bnin the Multiple Linear Regression algorithm is determined experimentally using an iterative 
approach. In the end, the performance of the proposed algorithm is assessed when it applies to a 


Heart-Disease database. 


Reference Method Key Findings Dataset Challenges 

[1] Feature selection Improved Heart disease Perform better 
algorithm accuracy results | dataset only for small 
(FCMIM), SVM for heart disease (Cleveland) dataset 

dataset 

[2] Hybrid Machine Better accuracy | Cardiovascular Limited features 
Learning, Hybrid | (87.8 %) disease dataset 
Random Forest 

[3] IoT, Machine Accuracy (97.5 Heart Disease | Limited features 
Learning %) dataset 
Methods, SVM 

[4] Various Machine Better Hungarian- Dimension issues, 
Learning measurements and Cleveland data accuracy 
classification select 
techniques and characteristics 
Principal results 
component 


analysis have been 
used to anticipate 
heart disease 


[5] 


Naive Bayes and 


Classification of | Kaggle dataset 


Features selection 


SVM were usedas heart disease and classification 
classifiers dataset, cause of performs slower 
heart disease, 
diabetes 


[6] k-NN algorithm 


Feature selection, 
Classification 


Kaggle dataset 


Feature 
Categorization 
can be improved. 
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Literature Review 

Machine learning involves several algorithms such as Regression Analysis (Linear, Multiple, 
Logistic), decision trees, random forests, k-Nearest Neighbors (KNN), support vector machine 
(SVM), Naive Bayes (NBs), classification tree (C4.5), gradient boosting machines (GBM), etc. While 
each of these algorithms processes data differently, in this section, a few recently proposed machine 
learning candidates in the area of malignant growth finding are reviewed chronologically. 

Below are some, Predictive Analysis of Heart Diseases with Machine Learning Approaches 

Overall, these studies demonstrate the utility of various machine learning and statistical algorithms 
in investigating the complex relationships between risk factors and heart disease, and in predicting 
the risk of adverse outcomes. By incorporating these advanced analytical tools into clinical practice, 
researchers and clinicians can better identify and manage high-risk patients, ultimately leading to 
improved cardiovascular health outcomes. 


The Pr Approach 

The proposed methodology is an enhancement of the Multiple Linear Regression Method for Test- 
Train-Partition and Random Forest for 10-Fold Cross Validation. This section briefly provides a 
background for the Multiple Linear Regression Method. 


Multiple Linear Regression 


Multiple linear regression is a statistical method used to model the relationship between a dependent 
variable and multiple independent variables. The basic idea is to use the values of the independent 
variables to predict the value of the dependent variable. 

The multiple linear regression model can be represented mathematically as: 

Y= Bo + B1Xı + B2X2+... + PnXnt € 

Where: 

Y is the dependent variable or the response variable that we want to predict 

Xi, X2, ..., Xn are the independent variables or predictors 

Bo, B1, B2, ..., Bn are the coefficients or the model parameters that determine the relationship between 
the independent variables and the dependent variable 

g is the error term, which represents the random variability that cannot be explained by the 
independent variables. 


Random Forest 

Random Forest is a machine learning algorithm that is commonly used for classification and 
regression tasks. It is an ensemble method that combines multiple decision trees and generates a more 
accurate and robust prediction. 

The algorithm works by creating a large number of decision trees (also called "forest"), each of which 
is trained on a random subset of the features and a random sample of the training data. This process 
is called "bagging" (short for "bootstrap aggregating"). Each decision tree is constructed by 
recursively splitting the data based on the feature that provides the most information gain or the best 
split according to some criterion (e.g., Gini impurity or information gain). Once all the decision trees 
have been constructed, the prediction of the random forest is obtained by averaging the output of all 
the individual trees (in the case of regression) or by taking a majority vote (in the case of 
classification). 

Users can use both parameters of Gini impurity by default and set their variance as a substitute for 
categorization. In regression, both of the parameters use mean square error to determine variance 
reduction. Variability reduction can also be calculated in Scikit-learn using mean absolute error [37]. 
Gini Impurity = 1 — Gini (5.1) 

Gini = Pi? + P? * P3*........+Pn? (5.2) 
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The Equation (5.1) represents the Gini impurity formula. Where P1 ...... Pn represents the 
probabilities of each possible class in solution space, Gini represents the purity, and Gini Impurity 
represents the impurity of a particular node. Here Gini works only for categorical targets. 

Random Forest has several advantages over individual decision trees. First, it reduces overfitting, 
which is a common problem with decision trees, by combining the predictions of multiple trees. 
Second, it is relatively insensitive to the choice of hyperparameters, such as the maximum depth of 
the trees, because the ensemble approach smooths out the noise in each individual tree. Third, it can 
handle a large number of features and a large training set, making it suitable for high-dimensional 
datasets. 

Random Forest is widely used in various applications, such as image classification, speech 
recognition, credit scoring, and drug discovery, to name a few. Its flexibility, accuracy, and ease of 
use make it one of the most popular machine learning algorithms. 


E re Selection 

The first goal in the proposed feature selection method is to reach at least the same accuracy rate as 
the whole features provide. The second goal is to improve the accuracy rate. Here, not only gathering 
extensive information on the features costs too much in terms of both the time and money, but also 
extra information results in wastage of time in classifying and diagnosis. As such, it is better to reduce 
the dimension in terms of the number of features to get a better response and to find a better correlation 
between the features and the outcomes. 

The Information Gain algorithm is a technique to select the best features. 


Information Gain Algorithm 

The information gain algorithm is a popular method used in decision tree learning and feature 
selection. It is used to determine the relevance of a feature or attribute in predicting a target variable 
in a dataset. 

The basic idea of the information gain algorithm is to calculate the amount of information provided 
by each attribute in the dataset and then select the attribute that provides the most information about 
the target variable. 

Here's how the algorithm works: 

Calculate the entropy of the target variable. Entropy is a measure of the randomness or uncertainty in 
the target variable. It is calculated as: 

H(Y) = -2 p(y) log2 p 

where p(y) is the proportion of samples that belong to class y. 

For each attribute in the dataset, calculate the entropy of the target variable after splitting the dataset 
based on the values of the attribute. This is called the conditional entropy and is calculated as: 
H(Y|X) = 5 p(x) H(Y|X=x) 

where p(x) is the proportion of samples with attribute value x and H(Y|X=x) is the entropy of the 
target variable for the samples with attribute value x. 

Calculate the information gain for each attribute as the difference between the entropy of the target 
variable before and after splitting the dataset based on the attribute. This is calculated as: 

IG(X) = H(Y) - H(Y|X) 

Select the attribute with the highest information gain as the next node in the decision tree. 

The idea behind the information gain algorithm is that the attribute with the highest information gain 
provides the most information about the target variable and is therefore the most useful attribute for 
predicting the target variable. By recursively applying this algorithm to the remaining attributes, a 
decision tree can be constructed that can be used for classification or regression tasks. 

However, the information gain algorithm has some limitations. For example, it may suffer from 
overfitting if the number of attributes is large compared to the number of samples in the dataset. To 
overcome this problem, various techniques such as pruning and regularization can be used. 
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METHODOLOGY 
In this paper, we work in the following field... 


Table.1.Data Field 


Attributes Description 

Age Age of a person 

Mode of Transport Type of vehicle 

Do you take medicines for Do the person medicine for diabetes? 
Diabetes? 


Time to stay beyond office hours How many hours the person stay in office? 


Stops smoking recently? Do the person quit smoking? 


Are you trying to start walk or Do the person exercise? 


exercise? 

Do you take medicine for liver? Do the person medicine for lever? 

Food habit Which type of food the person eat? 

Industry Type of the industry the person work at? 

Do you play games? Do the person play football, cricket etc.? 

How many time you socialize How many time the person active on Facebook, Instagram 
through social media? etc.? 


How frequently you check mail How many time he check mail and WhatsApp 


or/and WhatsApp? 

Are you trying to give up taking Is the person trying to give up eating meat? 
meat? 

Do you frequently take egg? How many times the person eat egg in a week? 
Time to stay in office How many hours the person work in the office? 


Do you take medicine for high Do the person medicine for high blood pressure? 
blood pressure? 


Position in office Name of the post the person work in office 
Do you frequently take alcohol? How many times the person drink alcohol in a week? 
Like to drink coke or similar? Is the person frequently drink coke? 


Do you like to take salt additionally Do the person take salt additionally in his food? 
in food you eat? 


Do you like to take ice-cream? Is the person frequently eat ice-cream? 
Go for walk? Do the person walk frequently? 
Smoker? Is the person smoker? 

Like to eat pizza? Is the person frequently eat pizza? 
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LOAD THE DATASET 


PREPROCESS THE DATA 


using feature selection algorithm, we 
have determined high“blood pressure/ 
high cholesterol” fie as dependent 


SPLIT DATASET INTO 10 FOLDS 


FLOW CHART 


For each fold in the dataset: 
e Train random forest model 
e Predict target values for held out fold 
e Evaluate model performance 


Multiple 


| Linear 
Compute average performance of model 
Comparison 


Determina i 
Tune hyperparameters 
Accuracy TL 
Rate 
Train Final model with optimal hyperparameters 
Make predictions on new data 


Using Weka (A popular open-source software tool for data mining and machine learning) , we 
processed the dataset for Feature Selection using Information Gain and selected the best features 
for maximum accuracy. 

Then using Multiple Linear Regression Algorithm in case of train-test-partition of data and 
Random Forest Algorithm for 10-Fold cross validation. 

Multiple linear regression is the strategy of statistics in regression that's familiar to analysing the link 
between one response variable (dependent variable) with 2 or additional controlled variables 


te Best “Y” 
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(independent variables). This methodology was selected for this analysis as a result there have been 
quite controlled variables. during this analysis, the response variable is cholesterol in blood Only(Y). 
Age (X1), food habit (X2), do he smoke (X3) etc. are controlled variables. 


e Confusion-Matrix 

After finding the accuracy of the difference between actual data and calculated data we did the 
Confusion Matrix. In this confusion matrix it can be seen that,[2] we find the TP — which stands for 
‘TRUE POSITIVE’ means the accuracy of classified positive data, TN — which stands for ‘TRUE 
NEGATIVE?’ means the accuracy of classified negative data, FP 

— which stands for ‘FALSE POSITIVE’, means which remark that actual value is negative but 
predicted data is positive, FN — which stands for ‘FALSE NEGATIVE’ means that actual data and 
the predicted data both are negative and append the TP, TN, FP, FN value in 2*2 matrix(mat1). After 
that, we find the accuracy, sensitivity, precision, recall, and specificity. This matrix contains all the 
raw information about the predictions done by a classification model on a given data set.[3] 

e Cross-Validation 

After finding the accuracy of the difference between actual data and calculated data we did cross- 
validation. In this cross-validation process first, we divide the whole list into 10 sub-list and then we 
find the accuracy of 10 sub-list elements we also find the Confusion Matrix of each Sub-list and we 
find the accuracy, and sensitivity, precision, recall, and specificity. 

ACCURACY: It’s the ratio of the correctly labeled subjects to the whole pool of subjects. Accuracy 
is intuitional. 

PRECISION: Precision is the ratio of the correctly +ve labelled by our program to all +ve labeled. 
RECALL: Recall means out of the total positive, what percentage are predicted positive. 
SPECIFICITY: Specificity is calculated as the number of correct negative predictions divided by 
the total number of negatives. 

e ACCURACY= (TP+TN/ TP+TN+FP+EN) * 100 

e PRECISION = (TP/FP+TP) *100 

e RECALL= (TP/FN+TP) *100 

e SPECIFICITY = (TN/TN+FP) * 100 


RESULT& DISCUSSION 
Table.2.Accuracy of difference between Actual data and Calculated data 
Accuracy of 90% Data as Training Data or (0.9) 63.64 
Accuracy of 80% Data as Training Data or (0.8) 69.23 
Accuracy of 75% Data as Training Data or (0.75) 80.95 
Accuracy of 66% Data as Training Data or (0.66) 76.67 


Table.3.Confusion Matrix & Corresponding Result 


For 66% Training Data For50% Training Data 
Confusion Matrix: 22 7 Confusion Matrix: 29 8 
24 7 
Accuracy:76.67 38 8 
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Precision: 75.86 Accuracy:80.72 
Recall:75.86 Precision:78.38 

Specificity: 77.42 Recall:78.38 
Specificity:82.61 

Data For 80% Training Data For 90% Training 

Confusion Matrix: 11 6 Confusion Matrix: 6 4 

16 6 

Accuracy:69.23 8 4 

Precision: 64.71 

Recall: 64.71 Accuracy:63.64 

Specificity: 72.73 Precision: 60 
Recall: 60 


Specificity: 66.67 


Table.4. For 10-foldcross-validation Accuracy 


TESTCASE ACCURACYRATE(%) 
1 71.43 
2 71.43 
3 85.71 
4 78.57 
5 61.54 
6 84.62 
7 84.62 
8 69.23 
9 53.85 
10 76.92 
Table.5. For 10-foldcross-validation Results 
1* Fold 2"4 Fold 
Confusion Matrix:[[9 0 0 0] Confusion Matrix:[[9 0 0 0] 
[2100] [2100] 
[1000] [1000] 
[100 0]] [0 10 0]] 
Accuracy:71.43 Accuracy:71.43 
Precision:66 Precision:59 
Recall:71 Recall:71 
3" Fold 4" Fold 
Confusion Matrix:[[9 0 0 0] Confusion Matrix:[[9 0 0 0] 
[03 00] [1200] 
[1000] [1000] 
[0 10 0]] [0 10 0]] 
Accuracy:85.71 Accuracy:78.57 
Precision:74 Precision:67 
Recall:86 Recall:79 
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5 Fold 

Confusion Matrix:[[8 1 0] 
[3 00] 
[1 0 0]] 


Accuracy:61.54 
Precision:46 
Recall:62 


7" Fold 
Confusion Matrix:[[9 0 0] 
[1 2 0] 
[1 00]] 
Accuracy: 84.62 
Precision: 80 
Recall: 85 


9th Fold 
Confusion Matrix: 
[3 0 0] 
[1 00]] 


[[7 2 0] 


Accuracy: 53.85 
Precision: 44 
Recall: 54 


CONCLUSION 


6" Fold 

Confusion Matrix:[[9 0 0] 
[1 20] 
[0 1 OJ] 


Accuracy:84.62 
Precision:78 
Recall:85 


8" Fold 
Confusion Matrix:[[9 0 0] 
[3 00] 
[1 00]] 
Accuracy: 69.23 
Precision: 48 
Recall: 69 


10 Fold 

Confusion Matrix:[[9 0 0] 
[2 10] 
[0 1 0J] 


Accuracy: 76.92 
Precision:68 
Recall: 77 


While there are many machine-learning methods available in the literature whose performances 
depend on different aspects including the dataset they are applied on, in this paper, a machine-learning 
method called Multiple Linear Regression Method was hybridized with a feature-selection 
Information Gain algorithm to classify the patients having High blood pressure, High Cholesterol or 
having both or having none. The objective of using information gain algorithm was to determine the 
best combination of the features that minimize the overall miscalculation of the Multiple linear 
regression method. Moreover, the best value for the number of neighbors in the Multiple linear 
regression algorithm was determined using an algorithm coded in Python. It was shown that when 
the Multiple linear regression method is hybridized with a feature selection algorithm, the 
classification accuracy increases significantly. As it mentioned before, 24 features had been chosen 
via the Information Gain algorithm. 

Future works may involve the use of other machine-learning classification algorithms or employing 
other population-based feature selection meta-heuristics and compare their performances to the one 
obtained by the proposed approach. 
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